Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

114

Binary Neural Architecture Search

Algorithm 10 Search process of DCP-NAS

Input: Training data, validation data

Parameter: Searching hyper-graph: G, M = 8, e(o⁽^i,j⁾

) = 0 for all edges

Output: Optimized ˆα^∗.

1: while DCP-NAS do

while Training real-valued Parent do

Search a temporary real-valued architecture p(w, α).

Decoupled optimization from Eqs. 4.43 to 4.53.

Generate the tangent direction ^∂^˜

f(w,α)

∂α

from Eqs. 4.21 to 4.29.

end while

Architecture inheriting ˆα ←α.

while Training 1-bit Child do

Calculate the learning objective from Eqs. 4.26 to 4.32.

10:

Tangent propagation from Eqs. 4.33 to 4.41 and decoupled optimization from Eqs.

4.43 to 4.53.

11:

Obtain the ˆp( ˆw, ˆα).

12:

end while

13:

Architecture inheriting α ←ˆα.

14: end while

15: return Optimized architecture ˆα^∗.

where

⊛

represents

the

Hadamard

product

and

η1η2.

take

ψ^t

−[^M

m=1

e^′=1 ^˜^g^e^′

∂wm

∂αe′,m ^,^{· · ·}^,^M

m=1

e^′=1 ^˜^g^e^′

∂wm

∂αe′,m ^]^T^{. Note that,}

∂w

∂α ^{is unsolvable and}

has no explicit form in NAS, which causes an unsolvable ψ^t. Thus we introduce a learnable

parameter ^˜ψ^tfor approximating ψ^t, which back-propagation process is calculated as

˜ψ^t⁺¹= | ˜ψ^t−ηψ

∂L

∂^˜ψ^t^|^.

(4.51)

Eq. 4.50 shows that our method is based on a projection function to solve the opti-

mization coupling problem by the learnable parameter ^˜ψ^t. In this method, we consider the

inﬂuence of α^tand backtrack the optimized state in the (t + 1)-th step to form ˜α^t⁺¹. How-

ever, the key point in optimization is where and when the backtracking should be applied.

Thus, we deﬁne the update rule as

˜α^t⁺¹

P(αt+1

:,m ^{, α}^t

:,m⁾^,

if ranking(R(wm)) > τ

˜α^t⁺¹

:,m ^,

otherwise

(4.52)

where P(α^t⁺¹

:,m ^{, α}^t

:,m^{) =}^α^t⁺¹

:,m ⁺^η^˜^ψ^t^⊛^α^t

:,m ^{and the subscript}^·^m ^{denotes a speciﬁc edge.}

R(wm) denotes the norm constraint of wm and is further deﬁned as

R(wm) = ∥wm∥²

2^,^∀^m^{= 1}^,^{· · ·}^{, M,}

(4.53)

where τ denotes the threshold for deciding whether or not to backtrack. We further deﬁne

the threshold as follows.

τ = ⌊ϵ · M⌋

(4.54)

where ϵ denotes a hyperparameter to control the percentage of edge backtracking. With

backtracking α, the supernet can learn to jump out of the local minima. The general process

of DCP-NAS is described in Algorithm 10. Note that the decoupled optimization can be